This lab allows users to construct a NooJ corpus (.noc file) from a single text file.

A NooJ corpus is a set of texts that are processed together. All the texts’ files must be represented in the same file format, and will be processed by the same set of linguistic resources (i.e. there is only one language module for a corpus).

(1) Enter the original file name which will be split into a large number of text files that will constitute the corpus. For instance, one might want to enter the file HERALDTRIBUNE1994.TXT in order to split it into a 12-file corpus (one file per month), or a 365-file corpus (one file per day), or even split it into thousands of files (one file per article).

(2) Enter the original file format. It might be UTF8 (Unicode) or (Other raw text formats > “Windows Western European”)

There is an option to process PDF files; warning: this functionality does not work with PDF images (i.e. when texts are represented by a series of images, rather than real text), and it does not work if the PDF is encrypted (i.e. protected against copy).

(3) Enter a Perl expression to identify delimitations between specific text units inside the corpus. For instance, in order to split the newspaper HERALDTRIBUNE1994.TXT into a set of articles, one needs to identify the header/title section that introduces each article, for instance, the following Perl expression:

^([A-Z]| )* ([0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9])

Would recognize headers such as:

A NEW PRESIDENT (1994-12-15)

(4) Each article/unit will be stored in a different file. Choose the base name for all the files, as well as the number used to numerate the first file. For instance, if the base name is “ht1994-” and the first file number is 1, then the file HERALDTRIBUNE.TXT will be split into:

ht1994-1.not, ht1994-2.not, ht1994-3.not, ht1994-4.not, ht1994-5.not, etc.

The resulting files can be stored either in a NooJ corpus (a single .noc file) or in a folder.